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ABSTRACT 


In STEM domains, students are expected to acquire domain 
knowledge from visual representations that they may not 
yet be able to interpret. Such learning requires perceptual 
fluency: the ability to intuitively and rapidly see which con- 
cepts visuals show and to translate among multiple visuals. 
Instructional problems that engage students in nonverbal, 
implicit learning processes enhance perceptual fluency. Such 
processes are highly influenced by sequence effects. Thus far, 
we lack a principled approach for identifying a sequence of 
perceptual-fluency problems that promote robust learning. 
Here, we describe a novel educational data mining approach 
that uses machine learning to generate an optimal sequence 
of visuals for perceptual-fluency problems. In a human ex- 
periment, we show that a machine-generated sequence out- 
performs both a random sequence and a sequence gener- 
ated by a human domain expert. Interestingly, the machine- 
generated sequence resulted in significantly lower accuracy 
during training, but higher posttest accuracy. This suggests 
that the machine-generated sequence induced desirable diffi- 
culties. To our knowledge, our study is the first to show that 
an educational data mining approach can induce desirable 
difficulties for perceptual learning. 
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1. INTRODUCTION 


Visual representations are ubiquitous instructional tools in 


science, technology, engineering, and math (STEM) domains 


[23]. For example, chemistry instruction on bonding typically 
includes the visuals shown in F igure [I] While we typically 
assume that such visuals help students learn because they 
make abstract concepts more accessible, they can also im- 


pede students’ learning if students do not know how the visu- 
als show information 227]. To successfully use visuals to learn 
new domain knowledge, students need representational com- 
petencies: knowledge about how visual representations show 
information [i]. For example, a chemistry student needs to 
learn that the dots in the Lewis structure in Figure 1(a) show 
electrons and that the spheres in the space-filling model in 
Figure 1(b) show regions where electrons likely reside. 


(a) ates 
7 Oy. 
H 4H 


Figure 1: Two commonly used visual representations of wa- 
ter (a: Lewis structure; b: space-filling model). 


Most instructional interventions that help students acquire 
representational competencies focus on conceptual represen- 
tational competencies. These include the ability to map 
visual features to concepts, support conceptual reasoning 
with visuals, and choose appropriate visuals to illustrate a 
given concept [5]. For example, chemists can explain how 
the number of lines and dots shown in the Lewis structure 
relate to the colored spheres in the space-filling model by 
relating these visual features to chemical bonding concepts. 
Such conceptual representational competencies are acquired 
via explicit, verbally mediated learning processes that are 
best supported by prompting students to explain how visu- 


als show concepts [20][27]. 


Less research has focused on a second type of representa- 
tional competency — perceptual fluency. It involves the 
ability to rapidly and effortlessly see meaningful informa- 
tion in visual representations . For example, chemists 
immediately see that both visuals in Figure }1] show water 
without having to effortfully think about what the visual 
shows. They are as fluent at seeing meaning in multiple vi- 
suals as bilinguals are fluent in hearing meaning in multiple 
languages. Perceptual fluency frees up cognitive resources 
for higher-order complex reasoning, thereby allowing stu- 
dents to use visuals to learn new domain knowledge [1627]. 


Students acquire perceptual fluency via implicit inductive 
processes fr2|[74]. These processes are nonverbal because 
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verbal reasoning is not necessary and may even inter- 
fere with the acquisition of perceptual fluency [20]. Con- 
sequently, instructional problems that enhance perceptual 
fluency engage students in simple problems to quickly judge 
what a visual shows [19]. For example, one type of perceptual- 
fluency problem may ask students to quickly and intuitively 
judge whether two visuals like the ones in Figure |1] show 
the same molecule. They ask students to rely on implicit 
intuitions when responding to a series of perceptual-fluency 
problems. Students typically receive numerous perceptual- 
fluency problems in a row. The problem sequence is typi- 
cally chosen so that (1) students are exposed to a variety of 
visuals and (2) consecutive visuals vary incidental features 
while drawing students’ attention to conceptually relevant 


features F 


However, these general principles are underspecified in the 
sense that they leave room for many possible problem se- 
quences. To date, we lack a principled approach capable 
of identifying sequences of visual representations that yield 
optimal learning outcomes for perceptual-fluency problems. 
To address this issue, we developed a novel educational data 
mining approach. Using data from human students who 
learned with perceptual-fluency problems, we trained a ma- 
chine learning algorithm to mimic human perceptual learn- 
ing. Then, we used an algorithm to search over possible 
sequences of visual representations to identify the sequence 
that was most effective for a machine learning algorithm. 
In a human experiment, we then tested whether (1) the 
machine-selected sequence of visual representations yielded 
higher learning outcomes compared to (2) a random se- 
quence and (3) a sequence generated by a human expert 
based on perceptual learning principles. 


In the following, we first review relevant literature on learn- 
ing with visual representations, perceptual fluency, and our 
machine learning paradigm. Then, we describe the methods 
we used to identify the machine-selected sequence and the 
methods for the human experiment. We also discuss how 
our results may guide educational interventions for repre- 
sentational competencies and educational data mining more 
broadly. 


2. PRIOR RESEARCH 


2.1 Learning with Visual Representations 

Theories of learning with visual representations define visual 
representations as a specific type of external representation. 
External representations are objects that stand for some- 
thing other than themselves — a referent [25]. When we see 
an image of a pizza, for example, the referent could be a slice 
of pizza (a concrete object). Alternatively, when used in the 
context of math instruction, the referent could be a fraction 
of a whole pizza (an abstract concept). Representations used 
in instructional materials are defined as external representa- 
tions because they are external to the viewer. By contrast, 
internal representations are mental objects that students can 
imagine and mentally manipulate. Internal representations 
are the building blocks of mental models; these models con- 
stitute students’ content knowledge of a particular topic or 
domain. External representations can be symbolic or visual. 
For instance, text or equations are symbolic external repre- 
sentations that consist of symbols that have arbitrary (or 
convention-based) mappings to the referent [32]. By con- 


trast, visual representations have similarity-based mappings 
to the referent [32]. 


Several theories describe how students learn from visual rep- 
resentations. Mayer’s Cognitive Theory of Multimedia 
Learning (CTML) and Schnotz’s Integrated Model of 
Text and Picture Comprehension (ITPC) draw on informa- 
tion processing theory |4] to describe learning from external 
representations as the integration of new information into a 
mental model of the domain knowledge. Here, we focus on 
learning processes relevant to visual representations. 


First, students select relevant sensory information from the 
visual representations for further processing in working mem- 
ory. To this end, students use perceptual processes that 
capture visuo-spatial patterns of the representation in work- 
ing memory [32]. To willfully direct their attention to rel- 
evant visual features, students draw on conceptual compe- 
tencies that enable top-down thematic selection of visual 


features : 


Second, students organize this information into an internal 
representation that describes or depicts the information pre- 
sented in the external representation. Because visual repre- 
sentations have similarity-based analog mappings to refer- 
ents, their structure can be directly mapped to the ana- 
log internal representations [10132]. In forming the internal 
representation, students engage perceptual processes that 
draw on pattern recognition of objects based on visual cues. 
They engage conceptual processes to map the visual cues 
to conceptual representational competencies that allow the 
retrieval of concepts associated with these objects. The re- 
sulting internal representation is a perceptual analog of the 
visual representation. It is depictive in that its organization 
directly corresponds to the visuo-spatial organization of the 
external visual representation [32]. 


Third, students integrate the information contained in the 
internal representations into a mental model of the domain 
knowledge (e.g., schemas, category knowledge). To this end, 
students integrate the analog internal representation into a 
mental model by mapping the analog features to informa- 
tion in long-term memory. This third step is what consti- 
tutes learning: students learn by integrating internal rep- 
resentations into a coherent mental model of the domain 


knowledge : 


In sum, students’ learning from visual representations hinges 
on their ability to form accurate internal representations of 
the representations’ referents and on their ability to inte- 
grate internal representations into a coherent mental model 
of the domain knowledge. This process involves both con- 
ceptual and perceptual competencies [27]. Although it is 
well established that conceptual and perceptual competen- 
cies are interrelated (16][17], it makes sense to distinguish 
them because they are acquired via qualitatively different 
learning processes [16][19][20]. As mentioned earlier, concep- 
tual representational competencies are acquired via verbally 
mediated, explicit processes [20]27]. By contrast, perceptual 
fluency is acquired via implicit, mostly nonverbal processes. 
Whereas most prior research on instructional interventions 
for representational competencies has focused on conceptual 
processes, we focus on perceptual processes. 
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2.2 Perceptual Fluency 

Research on perceptual fluency is based on findings that ex- 
perts can automatically see meaningful connections among 
representations, that it takes them little cognitive effort to 
translate among representations, and that they can quickly 
and effortlessly integrate information distributed across rep- 
resentations [12]. For example, experts can see “at a glance” 
that the Lewis structure in Figure 1(a) shows the same 
molecule as the space-filling model in Figure 1(b). Such 
perceptual fluency frees cognitive resources for explanation- 
based reasoning and is considered an important goal 
in STEM education. 


According to the CTML and the ITCP, perceptual fluency 
involves efficient formation of accurate internal representa- 
tions of visual representations [22)[32]. Perceptual fluency 
also involves the ability to combine information from dif- 
ferent visual representations without any perceived mental 
effort and to quickly translate among them [19]. Accord- 
ing to the CTML and ITCP, this allows students to map 
analog internal representations of multiple visual represen- 


tations to one another [22)[32]. 


Cognitive science literature suggests that stu- 
dents acquire perceptual fluency via perceptual-induction 


processes. These processes are inductive because students 
can infer how visual features map to concepts through ex- 
perience with many examples [r2\75]19]. Students gain effi- 
ciency in seeing meaning in visuals via perceptual chunking. 
Rather than mapping specific analog features to concepts, 
students learn to treat each analog visual as one percep- 
tual chunk that relates to multiple concepts. Perceptual- 
induction processes are thought to be nonverbal because 
they do not require explicit reasoning [20]. They are im- 
plicit because they occur unintentionally and sometimes un- 
consciously [33]. 


Interventions that target perceptual fluency are relatively 
novel. Kellman and colleagues developed interventions 
that engage students in perceptual-induction processes by 
exposing them to many short problems where they have to 
rapidly translate between representations. For example, stu- 
dents might receive numerous problems that ask them to 
judge whether two visuals like the ones shown in Figure 1 
show the same molecule. These interventions have enhanced 
students’ learning in domains like chemistry [30)[36]. 


Perceptual learning is strongly affected by problem sequences 
27|. To design appropriate problem sequences, consecutive 
problems expose students to systematic variation (often in 
the form of contrasting cases) so that irrelevant features vary 
but relevant features appear across several problems [r9]. 
However, a vital issue remains when designing problem se- 
quences for perceptual-fluency problems: Visual represen- 
tations differ on a large number of visual features. Con- 
sequently, countless potential problem sequences exist that 
systematically vary these visual features. How do we know 
which sequence is most effective? To address this issue, we 
propose a new educational data mining approach that draws 
on Zhu’s machine-teaching paradigm 


2.3 Machine Teaching Paradigm 


Simply put, machine teaching is the inverse problem of ma- 
chine learning. Machine learning refers to computer algo- 
rithms that select an optimal model for a given set of data. 
In other words, it determines which model fits the data 
best. Machine teaching, on the other hand, finds the op- 
timal (smallest) set of data for training such that a given 
algorithm selects a target model. Although the machine 
teaching paradigm has been applied to cognitive psychology 
and education [24], it has not yet been used in educational 
data mining research. 


Machine teaching requires a cognitive model i.e.,a learning 
algorithm that mimics how human students learn a mapping 
between visual representations like the ones shown in Fig- 
ure [ip. Given the cognitive model, machine teaching seeks 
a sequence of learning problems (optimal training sequence 
©) such that when given O, the learning algorithm learns 
the mapping. Here, O need not be independent and identi- 
cally distributed (i.i.d.). Machine teaching can be viewed as 
a communication problem between a teacher and a student: 
The goal is to communicate the mapping using the short- 
est message. The channel only allows messages in the form 
of a training sequence and the student decodes the message 
with the learning algorithm. In perceptual learning, stu- 
dents learn a mapping between visual features of two types 
of visual representations, allowing them to fluently translate 
among the visual representations. 


To evaluate whether a training sequence is effective, we test 
the cognitive model’s performance at mapping visual repre- 
sentations using a different set of perceptual-fluency prob- 
lems than used during training. Typically, a sequence of 
training problems (aka training instances in machine learn- 
ing) is drawn from a distribution of perceptual-fluency prob- 
lems used for training (P;). The set of test problems comes 
from a separate distribution of perceptual-fluency problems 
(P-). The goal is to minimize the test error rate on P-. The 
goal of machine teaching then becomes: 


O = argmin Poo,y)~P. (A(S)(2) # y) (1) 
SECt 


Here, C; is the set of all possible training sequences and A(S) 
is the learned hypothesis after training on the sequence S. 
Note that, O is not necessarily an i.i.d. sequence drawn 
from P;. One practical approach to approximately solve the 
optimization problem is shown in Algorithm [I] To properly 
construct the optimal training sequence in this given setting, 
we must understand: 


1. the nature of the to-be-learned domain knowledge 


2. the learning algorithm the cognitive model is using 


In this paper, the to-be-learned domain knowledge is well- 
known. It is the mappings between visual representations 
that students have to learn. Further, we used data from hu- 
man students learning from perceptual-fluency problems to 
generate a cognitive model that mimics how humans learn 
mappings between visual representations. Our goal is to 
investigate whether, when the mappings and the cognitive 
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model are well understood, machine teaching can identify 
a training set that is more effective than (a) a problem se- 
quence based on perceptual learning principles and (b) a 
random sequence. 


Algorithm 1 Machine Teaching 


1: Input: Learner A, Test Distribution P. 
2: O <+ Starting sequence 

3: €best <-error(train(A, O), Pe) 

4: while TRUE do 

5: N €neighbors(Q), eo1a — ebest 
6: for SéN do 

7: e +error(train(A, S), Pe) 

8: if € < best then 

9: Ebest — 6,0 + SF 
10: end if 
11: end for 
12: if Ebest — Cold then 
13: return O 
14: endif 


15: end while 


3. COGNITIVE MODEL 


We now describe how we constructed the cognitive model 
that was used to construct the training sequence. To this 
end, we first describe the perceptual-fluency problems, then 
describe how we formally represented these problems, which 
learning algorithm the cognitive model used, and finally how 
we used the cognitive model to identify the optimal training 
sequence. 


3.1 Perceptual-Fluency Problems 
Perceptual-fluency problems are single-step problems that 
ask students to make simple perceptual judgments. In our 
case, students were asked to judge whether two visual rep- 
resentations showed the same molecule, as shown in Fig- 
ure | Students were given two images. One image was of 
a molecule represented by a Lewis structure and the other 
image was a molecule represented by a space-filling model. 
They were asked to judge whether those two images show 
the same molecule or not. 


Are the following two molecules the same? 


Yes 


No 


Figure 2: In this sample perceptual-fluency problem, stu- 
dents judged whether or not the Lewis structure and the 
space-filling model showed the same molecule. The answer 
is yes. 


3.2 Visual Representation of Molecules 

In our experiment, we used visual representations of chem- 
ical molecules common in undergraduate instruction. To 
identify these molecules, we reviewed textbooks and web- 
based instructional materials. We counted the frequency of 
different molecules using their chemical names (e.g., H2O) 
and common names (e.g., water), and chose the 142 most 
common molecules. In order to formally describe the visual 
representations, we quantified visual features for each of the 
molecules. To this end, we first hand-coded the visual fea- 
tures that were present in the visual representations. For 
Lewis structures, these hand-coded features included counts 
of individual letters as well as information about different 
bonds present in each molecule, among others. For space- 
filing models, hand-coded features included counts of col- 
ored spheres, bonds, and other features. Further, we in- 
cluded several surface features that we expect human stu- 
dents attend to based on findings that humans tend to focus 
on broader surface features that are easily perceivable. Then 
we used the method found in to determine which subset 
of features (each for Lewis structure and space-filling model) 
humans attend to most. Building on these results, we cre- 
ated feature vectors for each of the molecules (Figure [3). 
These feature vectors of Lewis structures and space-filling 
models contained 27 and 24 features, respectively. These 
feature vectors were then used to train and test the learning 
algorithm. 


(a) 


Feature Feature Feature 
Vector x1 Vector x;_, Vector X12 
Molecule representation = H,O0 co; 
eo) |O=e=6 
¥ Features H H 
Number of connections 2 2 
Number of different letters 2 2 
Number of total letters 3 3 
Number of single lines 2 4 
(b) 
Feature Feature Feature 
Vector X_) Vector X;-> Vector Xi-142 
Molecule representation =» H,0 co, 
J Features @ @® 
Number of connections ut uy 
Number of sphere colors 2 2 
Number of total spheres 3 3 
Number of black-red bonds 0 2) 


Figure 3: Example features for HzO and COz2 molecule rep- 
resentations with feature vectors in red (a: Lewis structure; 
b: space-filling model). 


3.3. Learning Algorithm 

We used a feed-forward artificial neural network (ANN) 
as our learning algorithm. ANN is inspired by the biologi- 
cal neural network. A biological neuron produces an output 
when collective effect of its inputs reaches a certain thresh- 
old. It is still not clear exactly how the human brain learns 
but one assumption is that it is associated with the inter- 
connection between the neurons. ANNs try to model this 
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low level functionality of the brain. We chose ANN to be 
our learning algorithm due to this similarity. Our ANN took 
two feature vectors (21 and x2) as input. Each feature vec- 
tor corresponded to one of the two molecules shown. Given 
this input, the ANN produced a probability that the two 
molecules were the same. Then, given the correct answer 
y € {0,1} (here 1 means the two molecules are the same), 
the ANN updated its weights using the backpropagation al- 
gorithm. The backpropagation algorithm uses gradients to 
converge to an optima. Algorithm [2]shows the training pro- 
cedure of the neural network. It shows that the update pro- 
cedure also used a history window and multiple backprop- 
agation passes, an atypical approach for an ANN. We took 
two measures to address the issue that regular ANN algo- 
rithms do not learn from memory like humans do. First, we 
assumed that humans remember a fixed number of past con- 
secutive problems. Second, we assumed that after receiving 
feedback on the latest problem, humans update their inter- 
nal model by reviewing memorized problems (along with the 
latest problem) several times. To emulate this behavior, we 
introduced the history window and multiple backpropaga- 
tion passes. This procedure was followed for all problems in 
a given training sequence. 


Algorithm 2 train: training method for the NN learner 


1: Input: Training sequence S , Learning rate 7, History 
window size w, Number of backpropagations b 

2: H < [|] //initialize an empty history window 

3: for i= 1 - |S| do 

4: append(H, S{i]) //update history window 

5:  // train on the history window 

6: ow’ + |All 

7 fork =1— bdo 

8: for j =1— w’' do 

9: (x,y) < HU] 


10: backprop(x, y, 7) 
11: end for 
12: end for 


13: //check history window size 
14: if w’ > w then 


15: H.remove(0) //remove the oldest instance in his- 
tory 

16: end if 

17: end for 


A further, structural difference between our learning algo- 
rithm from a general artificial neural network is that our 
learning algorithm had two separate weight columns (one 
for each representation of the input molecules). The model 
architecture of the ANN is shown in Figure Here, the 
weights and outputs from one of the columns did not inter- 
act with those of the other column until the output layer. 
The network mapped the two inputs (feature vectors x; and 
x2) to a space wherein the same molecule shown by differ- 
ent representations are close to each other while different 
molecules are distant. These mapping functions are called 
embedding functions (one for each representation) and the 
space is called a common embedding space. Once the map- 
ping was complete, a judgment was possible regarding the 
similarity of the input molecules. This judgment was based 
on the distance in the common embedding space and made 
in the output layer of the ANN. Embeddings were generated 
in the layer before the output layer—the embedding layer. 


exp(-[lh — hll/a) 


L ly 


Embedding Layer 1 


Embedding Layer 2 


Linear Transformation Linear Transformation 


6 @6 @ 


Figure 4: Structure of the Artificial Neural Network learning 
algorithm 


Neurons in an ANN use a non-linear function called activa- 
tion function to introduce non-linearity. For all hidden layers 
before the embedding layer, we used the leaky rectifier 
activation function (the neuron employing leaky rectifier is 
called a leaky rectified linear unit or leaky ReLU). A stan- 
dard rectified linear unit (ReLU) allows only positive inputs 
to move onwards (outputs 0 otherwise). A leaky ReLU, on 
the other hand, outputs a small scaled input when the input 
is negative. Both ReLU and leaky ReLU have strong biolog- 
ical motivations. According to cognitive neuroscience stud- 
ies of human brains, neurons encode information in a sparse 
and distributed fashion [3]. Using ReLU, ANNs can also 
encode information sparsely. Besides this biological plau- 
sibility, sparsity also confers mathematical benefits like in- 
formation disentangling and linear separability. Rectified 
linear units also enable better training of ANNs [13]. The 
embedding layers, by contrast, do not use activation func- 
tions. Hence, the output of embedding layers are a linear 
transformation of its inputs. Given the inputs (x1, x2), let 
the ANN-generated embeddings be J; and Jz, respectively. 
Then, we computed the probability of the two representa- 
tions showing the same molecule in the output layer using 
the following equation: 


exp (-4—41) () 


Here, a is a trainable parameter that the ANN learns along 
with the weights. We thresholded this value at 0.5 to gen- 
erate the ANN prediction g € {0, 1}. 


3.4 Pilot Study - Train the Learning Algorithm 
Our first step was to train the learning algorithm to mimic 
human perceptual learning. To this end, we conducted a 
pilot experiment to find a good set of hyperparameters for 
the ANN learning algorithm. Hyperparameters of an ANN 
are variables that are set before optimizing the weights (e.g., 
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number of hidden layers, number of neurons in each layer, 
learning rate etc.). Our goal was to identify hyperparame- 
ters that make predictions matching human behavior on the 
posttest. Hence, we matched the algorithm’s predictions to 
summary statistics of human performance on the posttest. 


Our pilot experiment included 47 undergraduate chemistry 
students. They were randomly assigned to one of two con- 
ditions that used a random training sequence: supervised 
training (n = 35) or unsupervised training (n = 12). Partic- 
ipants in the supervised training condition received feedback 
after each training problem, whereas participants in the un- 
supervised condition did not receive feedback. We included 
the unsupervised training condition to generate an evalua- 
tion set (used to determine the success of pretraining). This 
evaluation set was used to pretrain the ANN learning algo- 
rithm. 


Let there be n supervised human participants. Each par- 
ticipant received a random pretest set, a random training 
sequence, and a random posttest set. We trained the ANN 
learning algorithm n times independently (once for each par- 
ticipant). While training for the i-th time we used the train- 
ing sequence viewed by the i-th supervised human partici- 
pant. The same posttest set viewed by this participant was 
also used to evaluate the performance of the ANN learning 
algorithm after training. Let the error on this posttest set 
for the i-th human participant and trained ANN learning 
algorithm be pp; and pn; respectively. Then, Equation [3] is 
a measure used to determine whether or not an ANN learn- 
ing algorithm’s performance is comparable to the average 
human. Note that lower error rates are desirable. 


1 n n 
tes = |— i i 
error rates | = bs pp So pn | (3) 


i=l 


Table [I] reports the accuracies of participants in the pilot 
experiment. 

Table 1: Accuracy in Pilot Experiment by Training Condi- 
tion. Average pretest, training and posttest accuracy with 
SEM in parentheses. 


Condition Pretest 
Supervised | 79.9 (1.8) 
Unsupervised | 77.9 (3.4) 


Training 
75.7 (1.2) 
78.5 (2.8) 


Posttest | 
89.4 (1.4) | 
77.1 (3.3) 


We note that humans usually have some degree of prior 
knowledge about chemistry. By contrast, the weights of an 
ANN are generally initialized at random. We address this 
issue by modeling the effect of prior knowledge, specifically 
we introduced a pretraining phase for the ANN learning al- 
gorithm. To this end, we drew a large sample of instances 
(10000) from the combined test and training distribution 
(3 Pe + +P) to form a pretraining set. Further, we com- 
bined the pretest problem across both the supervised and 
unsupervised conditions, along with the training problems 
in the unsupervised condition to form the pretraining eval- 
uation set. Because we did not provide feedback for these 
problems, we assumed that the participants did not learn 
anything new while going through them. Formally, let par- 


ticipants’ error on the pretraining evaluation set be called 
human pretraining error. We then trained the ANN learn- 
ing algorithm on the pretraining set. Note that an ANN 
can train over the over the same set over multiple iterations 
(formally known as epochs). We trained the ANN learning 
algorithm until its error on the pretraining evaluation set 
was smaller than human pretraining error. This concluded 
the pretraining phase. 


We used standard coordinate descent with random restart 
to find a good hyperparameter set. Coordinate descent suc- 
cessively minimizes the error rates along the coordinate di- 
rections (e.g., embedding size, learning rate). At each iter- 
ation, the algorithm chooses one particular coordinate di- 
rection while fixing the other values. Then, it minimizes in 
the chosen coordinate direction. Table |2| shows the values 
of the hyperparameters over which we decided to explore 
along with the best value found. These hyperparameters 
were used to identify the optimal training sequence. 


3.5 Finding an Optimal Training Sequence 
We used the ANN learning algorithm to generate an optimal 
training sequence for the perceptual-fluency problems. In 
Equation [i] we defined the optimization problem to solve. 
We solved this problem by searching over the space of all 
possible training sequences. Without limiting the size of 
the training sequence, the search space becomes infinite and 
infeasible. To mitigate this issue, we set the size of the 
candidate training sequences to 60. This aligns with prior 
research on perceptual learning [28]: 


O= argmin Pwe,y)~p. (A(S)(«) 4 y) (4) 
SEC, |S|=60 

We used a modified hill climbing algorithm to find such 
an optimal training sequence. Hill climb search takes a 
greedy approach. Procedurally, we started with one par- 
ticular training sequence. Then, we evaluated neighbors 
of that particular training sequence to determine whether 
a better one existed. If so, we moved to that one. This 
process stopped when no such neighbors were found. This 
search algorithm is defined with its states and neighborhood 
definition: 


e States: Any training sequence S € C; of size 60 


e Initial State: 
domain expert. 


e Neighborhood of S: Any training sequence that 
differs with S by one problem is a neighbor. For com- 
putational efficiency, we restricted ourselves to only in- 
specting 500 neighbors for a given training sequence. 
We do so by first selecting a problem S uniformly at 
random. Then we replace the selected problem with 
500 randomly selected problems with the same answer 
(ie., same y value). This made our search algorithm 
stochastic. 


4. HUMAN EXPERIMENT 


Our main goal was to evaluate whether the optimal train- 
ing sequence yields higher learning outcomes. To this end, 
we conducted a randomized, controlled experiment with hu- 
mans. Here, we discuss our experimental setup and associ- 
ated results. 


A training sequence selected by a 
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Table 2: Hyper-parameters for the ANN learning algorithm 


4.1 Participants 

We recruited 368 participants using Amazon’s Mechanical 
Turk (MTurk) Among them, 216 were male and 131 
were female. The rest did not disclose their gender. Most 
of the participants were below the age of 45 (86%) and the 
greatest number (192) fell in the age group 24 — 35. Among 
the 95.4% who disclosed their knowledge about chemistry, 
45.7% had taken an undergraduate-level chemistry class. 


4.2 Test Set 


Because our goal was to assess transfer of learning from the 
training sequence to a novel test set, we chose training and 
test problems from separate distributions. Hence, we ran- 
domly divided the 142 molecules that we selected for this 
experiment into two sets of 71 (training molecules, V7; and 
test molecules X.). One of the sets was used to create the 
test distribution, whereas the other one was used to create 
the training distribution. We now describe in more detail 
how we created the test distribution P. because our goal 
was to reduce humans’ error rates on the test set. We used 
the following procedure. 


© %1 ~ pi, where p; is a marginal distribution on %.. pi 
is “importance of molecule x to chemistry education” 
and was constructed by manually searching a corpus 
of chemistry education articles for molecule text fre- 
quency. 


e With probability 1/2, set r2 = x1 so that the true 
answer y = 1. 


e Otherwise, draw x2 ~ po(- | v1). The conditional dis- 
tribution pz is based on domain experts’ opinion that 
favors confusable x21, x2 pairs in an education setting. 
Also note that, p2(r1|21) = 0,Vx1. Taken together, 


1 1 
P.(21, 22) = gPi (#1) {21-29} + gP1 (#1 )p2 (xa | 1). 


Both the pretest and posttest judgment problems were sam- 
pled from this distribution across all conditions. 


4.3 Experimental Design 
We compared three training conditions: 


1. In the machine training sequence condition, we used 
the optimal training sequence O found by the modi- 
fied hill climb search algorithm. For all (21,72) € O 
(here x1 € %, x2 € A), the corresponding true answer 
y was the indicator variable on whether x1 and x2 were 
the same molecule: y = Iy,,—2,}. We presented x1 and 


Parameter name Values explored | Best value 
Embedding size 1, 2,4, 8,16 2 
Learning rate 0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1 0.0001 
History window size 0,1, 2,4, 8, 16, 32,60 2 
Backprop count 1, 2,4, 8,16 2 
Number of hidden layers before embedding layer 0,1,2,3,4 0 
Number of hidden units in each column 10, 20, 40, 80, 160 N/A 


x2 in Lewis and space-filling representations to the hu- 
man participants, respectively. Participants gave their 
binary judgment 7 € {0,1}. We then provided the 
true answer y as feedback to the participant. 


2. In the human training sequence condition, the training 
sequence was constructed by a domain expert using 
perceptual learning principles (using molecules only 
from %;). Specifically, an expert on perceptual learn- 
ing constructed the sequence based on the contrasting 
cases principle [r9][30], so that consecutive examples 
emphasized conceptually meaningful visual features, 
such as the color of spheres that show atom identity 
or the number of dots that show electrons. The rest 
of this condition was the same as the machine training 
sequence condition. This training sequence is identical 
to the initial state of the modified hill climb search al- 
gorithm that we used to generate the machine training 
sequence. 


3. In the random training sequence condition, each train- 
ing problem (x1,2%2) was selected from the training 
distribution P; with y = Iy2,=2.}. The training distri- 
bution P; for this condition was induced in the same 
manner as the test distribution P. but on the set of 
training molecules 1. The rest of the condition was 
the same as the previous ones. 


4.4 Procedure 

We hosted the experiment on the Qualtrics survey plat- 
form using NEXT [18]. Participants first received a 
brief description of the study and then completed a sequence 
of 126 judgment problems (yes or no). The problems were 
divided into three phases as follows. First, participants re- 
ceived a pretest that included 20 test problems without feed- 
back. Second, participant received the training, which in- 
cluded 60 training problems sequenced in correspondence to 
their experimental condition. During this phase, correctness 
feedback was provided for submitted answers. We assumed 
that participants learned during this phase because they re- 
ceived feedback. Third, participants received a posttest that 
included 40 test problems without feedback. In addition, one 
guard problem was inserted after every 19 problems through- 
out all three phases. A guard question either showed two 
identical molecules depicted by the same representation or 
two highly dissimilar molecules depicted by Lewis structures. 
We used these guard questions to filter out participates who 
clicked through the problems haphazardly. In our main anal- 
yses, we disregarded the guard problems. So that no visual 
representation was privileged, we randomized their positions 
(left vs. right). 


4.5 Results 
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Of the 368 participants, we excluded 43 participants who 
failed any of the guard questions. The final sample size was 
N = 325. The final number of participants in the condi- 
tions random, human, and machine training sequence were 
108, 117 and 100 respectively. Table [3] reports accuracy on 
the pretest, training set, and posttest. See Figure [5] for a 
graphical depiction of the same data. 

Table 3: Accuracy by Training Condition. Average pretest, 
training and posttest accuracy with SEM in parentheses. 


Condition 


Pretest 


Training 


Posttest | 


Machine 
Human 


69.5 (1.1) 
71.3 (1.3) 


63.9 (1.1) 
72.4 (1.0) 


74.7 (1.1) 
71.7 (1.0) 


Random 


69.4 (1.1) 


70.3 (1.1) 


71.1 (1.1) | 


4.5.1 Effects of condition on training accuracy 

First, we tested whether training condition affected partic- 
ipants’ accuracy during training. To this end, we used an 
ANCOVA (Analysis of COVAriance) with condition as the 
independent factor and training accuracy as the dependent 
variable. Because pretest accuracy was a significant pre- 
dictor of training accuracy, we included pretest accuracy 
as the covariate. Results showed a significant main effect 
of condition on training accuracy, F(2,321) = 18.8,p < 
.001, 7? = .082. Tukey post-hoc comparisons revealed that 
(a) the machine training sequence condition had significantly 
lower training accuracy than the human training sequence 
condition (p < .001,d = —0.32), (b) the machine train- 
ing sequence condition had significantly lower training ac- 
curacy than the random training sequence condition (p < 
.001,d = —0.26), and (c) no significant differences existed 
between the human and random training sequence condi- 
tions (p = .592,d = 0.05). In other words, during the train- 
ing phase, the human and random training sequences were 
equally effective in terms of accuracy, but the machine train- 
ing sequence was less effective. 


Condition 
=®= Machine 


*@ Human Expert 


==| Random 


Accuracy (%) 
3 


T T 
Training Posttest 


Assessment 


T 
Pretest 


Figure 5: Learning progress between conditions revealed 
an initial disadvantage, but ultimate advantage for the 
machine-generated sequence. 


4.5.2 Effects of condition on posttest accuracy 

Next, we tested whether training condition affected partici- 
pants’ posttest accuracy. To this end, we conducted an AN- 
COVA with condition as the independent factor and posttest 
accuracy as the dependent variable. Because pretest ac- 
curacy was a significant predictor of posttest accuracy, we 
included pretest accuracy as a covariate. Results showed 


a significant main effect of condition on posttest accuracy, 
F (2,321) = 5.02,p < .01,7” = .023. Tukey post-hoc com- 
parisons revealed that (a) the machine training sequence 
condition had significantly higher posttest accuracy than the 
human training sequence condition (p < .05,d = 0.16), (b) 
the machine training sequence condition had significantly 
higher posttest accuracy than the random sequence condi- 
tion (p < .05,d = 0.14), and (c) no significant differences 
existed between the human and random training sequence 
conditions (p = .960,d = —0.02). In other words, the hu- 
man and random training sequences were equally effective 
and the machine training sequence was most effective. 


5. DISCUSSION 


Our goal was to investigate whether a novel educational data 
mining approach can help identify a training sequence of vi- 
sual representations that enhances students’ learning from 
perceptual-fluency problems. To this end, we applied the 
machine teaching paradigm. It involved gathering data from 
human students learning from perceptual-fluency problems. 
Next, we generated a cognitive model that mimics human 
perceptual learning. We then used the cognitive model to 
reverse-engineer an optimal training sequence for a machine- 
learning algorithm. Finally, we conducted an experiment 
that compared the machine training sequence to a random 
sequence and to a principled sequence generated by a human 
expert on perceptual learning. Results showed that the ma- 
chine training sequence resulted in lower performance during 
training, but greater performance on a posttest. 


These findings make several important contributions to the 
perceptual learning literature. First, our results can in- 
form the instructional design of perceptual-learning prob- 
lems. Even though prior research yields principles for effec- 
tive sequences of visual representations, numerous potential 
sequences can satisfy these principles. Our results show that 
this new educational data mining approach can help address 
this problem. Given a learning algorithm that constitutes 
a cognitive model of students learning a task, instructors 
can identify a sequence of problems that likely yields higher 
learning outcomes. 


Second, our results expand theory on perceptual learning. 
The fact that the machine learning sequence yielded lower 
performance during training but greater posttest scores sug- 
gests that this sequence induced desirable difficulties dur- 
ing learning [19}[34]/40]. The concept of desirable difficul- 
ties describes the common finding that instructional tech- 
niques yield lower performance during training, but higher 
long-term learning outcomes. To explain this phenomenon, 
Soderstrom and Bjork proposed that more difficult learn- 
ing interventions induce more active processing during train- 
ing. This lowers immediate performance due to the in- 
creased difficulty, but results in more durable memories and 
greater long-term learning. Our findings suggest that the 
machine teaching approach was successful because it iden- 
tified a training sequence that induced desirable difficulties. 
To the best of our knowledge, our study is the first to show 
that an educational data mining approach can be used to 
induce desirable difficulties for perceptual learning. 


Our findings also contribute to the educational data min- 
ing literature. We provide the first empirical evidence that 


Proceedings of the 11th International Conference on Educational Data Mining 144 


an ANN learning algorithm constitutes an adequate cogni- 
tive model of learning with visual representations. As far 
as we know, the machine teaching paradigm has thus far 
only been applied to learning with artificial visual stim- 
uli that vary on only one or two dimensions (e.g. Gabor 
patches f1a]). Thus, our study provides the first demonstra- 
tion that machine learning along with machine teaching is 
a viable approach to modeling and improving learning with 
realistic, high-dimensional visual representations like Lewis 
structures and space-filling models of chemical molecules. 
Many other domains like biology, engineering, math also use 
high-dimensional visual representations. Therefore, we be- 
lieve this approach is valuable for educational data mining 
research. 


6. LIMITATIONS AND FUTURE 


DIRECTIONS 


Our findings should be interpreted against the background 
of the following limitations. First, the population of MTurk 
workers may limit generalization to the target population 
of undergraduate chemistry students. MTurk workers have 
highly variable prior knowledge about chemistry. As men- 
tioned previously, around 45.7% of the participants had taken 
an undergraduate level chemistry class. This suggests that 
their prior knowledge may have been lower and more di- 
verse than that of a typical undergraduate chemistry stu- 
dent. Hence, we plan to test whether the machine training 
sequence leads to better learning for undergraduate chem- 
istry students. 


Second, the search algorithm we used to find the machine 
training sequence did not test all possible training sequences 
of size 60. As mentioned previously, we only inspected 500 
neighbors (out of a potential 5040 = 71 x 71 — 1) for any 
given training sequence. Moreover, we stopped the search 
algorithm after a predetermined amount of time. We chose 
this inexhaustive approach because exhaustively finding a 
solution is not computationally feasible. Thus, we settled 
for a suboptimal training sequence that still yielded a small 
risk on the test distribution. Consequently, it is possible to 
find a better training sequence than the one we used in our 
experiments. 


Third, while determining the hyperparameters of the ANN 
learning algorithm such that it mimics human perceptual 
learning, we only searched over a subset of all possible hy- 
perparameters. As a result, it is possible that a better set of 
hyperparameters exists. Our study was also limited in that 
we did not account for individual prior knowledge. Hence, 
future research needs to investigate how to expand the ap- 
proach presented in this paper to modeling individual prior 
knowledge (e.g., for adaptive teaching or personal training). 


A fourth limitation of the present experiments is that our 
study was constrained in the use of chemistry representa- 
tions as stimuli. While we used realistic representations that 
are more high-dimensional than prior perceptual learning 
studies [9/11/35] and that are more representative of com- 
monly used visual representations in a variety of STEM do- 
mains, the complexity of the representations we considered 
does not reflect all realistic stimuli. Still we see no reason 
why this approach could not be applied to other representa- 
tions in other domains. Sparser and richer visuals exist and 


it is possible that machine teaching may yield greater ben- 
efits for sparser visuals. We will investigate this hypothesis 
in future studies. 


7. CONCLUSION 


This paper advanced a novel educational data mining ap- 
proach to identify optimal sequences of visual representa- 
tions for perceptual-fluency problems. Students’ difficulties 
in learning with visual representations is partly due to a 
lack of perceptual fluency. This increases the cognitive de- 
mands of learning with visuals. Perceptual-fluency prob- 
lems are a relatively novel type of instructional intervention 
that can aid learning from visuals by freeing up cognitive 
resources for higher-order complex reasoning. Thus far, we 
have lacked a principled approach capable of identifying ef- 
fective sequences of visual representations. Our educational 
data mining approach relied solely on students’ responses to 
perceptual-fluency problems to select a sequence of visuals 
that is effective for a machine learning algorithm mimicking 
human perceptual learning. Our results showed that this 
approach is more effective than conventional perceptual flu- 
ency instruction. Further, the effectiveness of our approach 
lies in its ability to induce desirable difficulties. Given the 
pervasiveness of visual representations in STEM domains, 
we anticipate that our findings will be broadly useful for 
students’ learning with visual representations. We also plan 
to investigate how the machine generated sequence induced 
desirable difficulties in the humans. 
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